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Abstract 



Q-e Statistical language models frequently suffer from a lack of training 

data. This problem can be alleviated by clustering, because it reduces 
the number of free parameters that need to be trained. However, 
■ clustered models have the following drawback: if there is "enough" 

^ . data to train an unclustered model, then the clustered variant may 

perform worse. On currently used language modeling corpora, e.g. the 
Wall Street Journal corpus, how do the performances of a clustered 
and an unclustered model compare? While trying to address this 
question, we develop the following two ideas. First, to get a clustering 
algorithm with potentially high performance, an existing algorithm 
is extended to deal with higher order N-grams. Second, to make it 
possible to cluster large amounts of training data more efficiently, 
a heuristic to speed up the algorithm is presented. The resulting 
clustering algorithm can be used to cluster trigrams on the Wall Street 
Journal corpus and the language models it produces can compete with 
existing back-off models. Especially when there is only little training 
data available, the clustered models clearly outperform the back-off 
models. 
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1 Introduction 



It is well known that statistical language models often suffer from a lack of 
training data. This is true for standard tasks and even more so when one tries 
to build a language model for a new domain, because a large corpus of texts 
from that domain is usually not available. One frequently used approach to 
alleviate this problem is to construct a clustered language model. Because it 
has fewer parameters, it needs less training data. The main advantage of a 
clustered model are its robustness, even in the face of little or sparse training 
data, and its compactness. Particularly when a language model is used during 
the acoustic search in a speech recogniser, having a more compact, e.g. less 
complex model, can be of considerable importance. The main drawback of 
clustered models is that they may perform worse than an unclustered model, 
if there is "enough" data to train the latter. Do corpora currently used for 
language modeling, e.g. the Wall Street Journal corpus, contain enough data 
in that sense? Or, in other words, how does the performance of a clustered 
model compare with that of an unclustered model? In this paper, we will 
attempt to partly answer this question and, along the way, an extended, more 
efficient clustering algorithm will be developed. 

In the next section (section |2|), a brief review of existing clustering al- 
gorithms will be given. For the work presented here, we use the clustering 
algorithm proposed in J8|, because, in the spirit of decision directed learning, 
it uses an optimisation function that is very closely related to the final per- 
formance measure we wish to maximise. Since the algorithm forms the basis 
of our work, its optimisation criterion is derived in detail. 

In order to achieve a clustered model with potentially high performance, 
the algorithm is then extended (section so that it can cluster higher order 
N-grams. We present three possible approaches for this extension and then 
develop the one chosen for this work in more detail. 

When such a clustering algorithm is applied to a large training corpus, 
e.g. the Wall Street Journal corpus, with tens of millions of words, the 
computational effort required can easily become prohibitive. Therefore, a 
simple heuristic to speed up the algorithm is developed in section |[ Its 
main idea is as follows. Rather than trying to move each word w to all 
possible clusters, as the algorithm requires initially, one only tries moving 
w to a fixed number of clusters that have been selected from all possible 
clusters by a simple heuristic. This reduces the order of the complexity of 
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the algorithm. Of course, it may lead to a decrease in performance. However, 
in practice, the decrease in performance is minor (less than 5%), whereas the 
obtained speedup is large (up to a factor of 32). 

Because of the increase in the speed of the algorithm, it can be applied 
more easily to the Wall Street Journal corpus and the obtained results will 
be presented in section ||. 



2 Background and Related Work 

In speech recognition, one is given a sequence of acoustic observations A and 
one tries to find the word sequence W* that is most likely to correspond to A. 
In order to minimise the average probability of error, one should, according 
to Bayes' decision rule (|| p. 17]), choose 

W* = argmax w p(W\A). (1) 

Based on Bayes' formula (see for example j|, p. 150]), one can rewrite the 
probability from the right hand side of equation |I] according to the following 
equation: 

V (w\a) = mivmi . (2) 

p(W) is the probability that the word sequence W is spoken, ^(AIW 7 ) is the 
conditional probability that the acoustic signal A is observed when W is 
spoken and p(A) is the probability of observing the acoustic signal A. Based 
on this formula, one can rewrite the maximization of equation |T] as 

p(W)*p(A\W) 
W = argmaxw t^t • (3) 

Since p(A) is the same for all W, the factor p(A) does not influence the choice 
of W and maximising equation ^| is equivalent to maximising 

W* = argmax w piyV) *p(A\W). (4) 

The component of the speech recogniser that calculates is called the 

acoustic model, the component calculating p(W) the language model. With 



3 



W = wi,...,w n , one can further decompose p(W) using the definition of 
conditional probabilities as 

i=n 

p(W) = ]]_p(w i \w 1 , (5) 

i=l 

In practice, because of the large number of parameters in equation the 
probability of Wi usually only depends on the immediately preceding M 
words: 

p(w i \w 1 , Wi-i) PS p(Wi\Wi- M , IWi-O- (6) 

These models are called (M + l)-gram models and in practice, mostly bigram 
(M = 1) and trigram (M = 2) models are used. Even in these cases, the 
number of parameters that need to be estimated from training data can be 
quite large. For a speech recogniser with a vocabulary of 20, 000 words, the 
bigram needs to estimate roughly 20, 000 2 = 4*10 8 parameters and a trigram 
20,000 3 = 8 * 10 12 . 

One way to alleviate this problem is to use class based models. Let 
G : w — > G(w) = g w be a function that maps each word w to its class 
G(w) = g w and let \G\ denote the number of classes. We can then model the 
probability of Wi as 

p(w i \wi,...,W i - 1 ) PS p G (Wi\Wi- M ,—,Wi-l) (7) 

= p G (G(w i )\G(w^ M ),..,G(w^ 1 ))*p G (w i \G(w i )).(S) 

Thus, if |<j |=1000 classes are being used, the class-based bigram model Qhas 
1, 000 2 + 20, 000 = 1.02 * 10 6 parameters and the class-based trigram model 
1, 000 3 + 20, 000 = 1.00002 * 10 9 . This constitutes a significant reduction in 
both cases. 

Many researchers have developed algorithms for determining the clus- 
tering function G automatically (see for example M, ]5[], M, M and ]T2[). 
Starting from an initial clustering function, the basic principle often is to 
move words from their current cluster to another cluster, but only if this 
improves the value of an optimisation criterion. The algorithms often differ 
in the optimisation criterion and in general, there are many possible choices 
for it. However, in the spirit of decision-directed learning, it makes sense to 

lr This model is also sometimes referred to as bi-pos model, where pos stands for Parts 
Of Speech. 
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use as optimisation criterion a function that is very closely related or iden- 
tical to the final performance measure we wish to maximise. This way, one 
can be very confident that an improvement in the optimisation criterion will 
actually translate to an improvement of performance. We therefore chose 
the algorithm proposed in || as the basis for our work. In the following, 
the optimisation criterion for a bigram based model (e.g. M — 1) will be 
derived, roughly as presented in ||. 

In order to automatically find classification functions G, the classification 
problem is first converted into an optimisation problem. Suppose the function 
F(G) indicates how good the classification G is. One can then reformulate 
the classification problem as finding the classification G* that maximises F: 

G* = argmaxa<^gF(G), (9) 

where Q contains the set of possible classifications which are at our disposal. 

What is a suitable function F, also called optimisation criterion? Given 
a classification function G, the probabilities pa(w\v) of equation |8| can be 
estimated using the maximum likelihood (ML) estimator, e.g. relative fre- 
quencies: 

p G (w\v) = p{G(w)\G{v)) * p(w\G(w)) (10) 
N(G(v),G(w)) N{G(w),y) 

N(G(v)) N(G(w)) ' 1 ' 

where N(x) denotes the number of times x occurs in the training data. Given 
these probability estimates pa(w\v), the likelihood Fml of the training data, 
e.g. the probability of the training data being generated by our probability 
estimates pc(w\v), measures how well the training data is represented by the 
estimates and can be used as optimisation criterion (||). The likelihood of 
the training data Fml is simply 

N 

F M l = Y[pG(wi\wi-i) (12) 

8=1 

n N(g Wi ^g Wi ) N{g Wi , Wi ) 
" tk N(g w ^) N(g Wi ) ' 1 ) 

Assuming that the classification is unique, e.g. that G is a function, N(g Wi ,Wi) 
N(wi) always holds (because Wi always occurs with the same class g Wi ). Since 
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one tries to optimise Fml with respect to G, any term that does not depend 
on G can be removed, because it will not influence the optimisation. It is 
thus equivalent to optimise 

= n N{gw ^ *^ (14) 

TV 

=: ( 15 ) 

i=l 

If, for two pairs (wi-i,Wi) and (wj-i,Wj), G(wi-x) = G(wj-x) and G{wi) = 
G(wj) holds, then /(to^x, u^) = f(vjj-.i, Wj) is also true. Identical terms can 
thus be regrouped to obtain 

31,92 iV(# 2 ) 

where the product is over all possible pairs (gi,g 2 )- Because N(gi) does not 
depend on g 2 and N(g 2 ) does not depend on g l7 this can again be simplified 
to 

1 N( 91 ) i N{ 92 ) 

After taking the logarithm, one obtains the equivalent optimisation criterion 
F" 

ML 



Kl = J2 N (^92)*log(N(g 1 ,g 2 ))-J2N(g 1 )*log(N(g 1 )) (18) 

91,92 91 

y £N(g 2 )*log(N(g 2 )). 
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F ML is the maximum likelihood optimisation criterion that can be used 
to find a good classification G. However, the problem with this maximum 
likelihood criterion is that one first estimates the probabilities Pg{w\v) on the 
training data T and then, given pc(w\v), one evaluates the classification G 
on T. In other words, both the classification G and the estimator pc(w\v) are 
trained on the same data. Thus, there will not be any unseen event, a fact 
that overestimates the power for generalisation of the class based model. In 
order to avoid this, a cross-validation technique will be incorporated directly 
into the optimisation criterion in section [2.1| . 
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2.1 Leaving- One- Out Criterion 

The basic principle of cross-validation is to split the training data T into 
a "retained" part T R and a "held-out" part T H . One can then use T R to 
estimate the probabilities pq{w\v) for a given classification G, and T H to 
evaluate how well the classification G performs. The so-called leaving-one- 
out technique is a special case of cross-validation (|| pp.75]). It divides the 
data into N—l samples as "retained" part and only one sample as "held-out" 
part. This is repeated N — l times, so that each sample is once in the "held- 
out" part. The advantage of this approach is that all samples are used in the 
"retained" and in the "held-out" part, thus making very efficient use of the 
existing data. In other words, the "held-out" part T H to evaluate a classifica- 
tion G is the entire set of data points; but when we calculate the probability 
of the i th data point, one assumes that the probability distributions pa{w\v) 
were estimated on all the data expect point i. 

Let Ti denote the data without the pair (u>j_i, u>j) and pG,Tiiw\v) the 
probability estimates based on a given classification G and training corpus 
Tj. Given a particular Tj, the probability of the "held-out" part W7j) is 

PG,T t ( w i\ w i-i)- The probability of the complete corpus, where each pair is in 
turn considered the "held-out" part is the leaving-one-out likelihood Llo 

N 

Llo = Y[pG,TA w i\ w i-i)- ( 19 ) 
i=i 

In the following, an optimisation function Flo will be derived by specifying 
how pG,Ti{wi\wi_i) is estimated from frequency counts. First po^w^Wi-i) 
is rewritten as usual (see equation |8|): 

Pa,Ti(w\v) = p G ,T i {G{w)\G{v)) * p G>Ti {w\G{w)) (20) 
Pg,tA9u92) Pg,tX92,w) 
Vg,tA9x) Pg,tA92) 

where gi = G(v) and #2 = G(w). Now we will specify how each term in 
equation |21] is estimated. 

As shown before, Pg,tX92,w) = pcrXw) (if the classification G is a func- 
tion) and since Pt^w) is actually independent of G, one can drop it out of 
the maximization and thus need not specify an estimate for it. 
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As will be shown later, one can guarantee that every class g\ and g 2 has 
been seen at least once in the "retained" part and one can thus use relative 
counts as estimates for class uni-grams: 

P0,M = ^> (22) 

Pg,tM = • (23) 

iV Ti 

However, in the case of the class bi-gram, one might have to predict 
unseen events Q We therefore use the absolute discounting method (@), 
where the counts are reduced by a constant value b < 1 and where the 
gained probability mass is redistributed over unseen events. Let n T . be the 
number of unseen pairs (gi,g 2 ) an d n +tT . the number of seen pairs (g 1 ,g 2 ). 
This leads to the following smoothed estimate 

PG,Ti{9li92) 

^ ifJVr,( 9l ,ft)=0 1 ' 



Ideally, one would make b depend on the classification, e.g. use b - , , , ; . 
where ri\ and n 2 depend on G. However, due to computational reasons, we 
use, as suggested in M, the empirically determined constant value b = 0.75 
during clustering. The probability distribution PG^feb^) will always be 
evaluated on the "held-out" part (wi-i,Wi) and with gij = g m _ x and g 2} i = 
g Wi , one obtains 



PG,T i {gx,i,g2,i) 



NT ^'^- b XN Ti (g 1 , i ,g 2 , i )>0 

^pSr ifiV Ti (^,^) = o 



N T . 



(25) 



In order to facilitate future regrouping of terms, one can now express the 
counts N T ., N T .(gi) etc. in terms of the counts of the complete corpus T as 



2 If (wi-i, Wi) occurs only once in the complete corpus, then PG,Ti(wi\wi^i) will have to 
be calculated based on the corpus Ti, which does not contain any occurrences of i, Wi). 
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follows: 



NtM 

N Ti (gi,h 92,1) 

N T . 

n 0,Ti 



N T -1 

N T (gi) - 1 
N T (g 2 ) - 1 

N T (gi,i, 92,i) - 1 
N T -1 

{ n +jT if N T (g hi , g 2>i ) > 1 

\ n +:T - 1 if N T (gi,i, 92,i) = 1 

f n 0jT if N T (g hi , g 2>i ) > 1 

\ n 0tT - 1 if N T (g hi , g 2>i ) = 1 



(26) 
(27) 
(28) 
(29) 
(30) 

(31) 
(32) 



All the expressions can now be substituted back into equation [L9]. After 
dropping pQ Ti {w) because it is independent of G, one arrives at 



tt PG,Tj(9l,i, 92,i) + 

A=i PG,T t (gi,i) 



PG,T i (92, 

U(PG,T i (gi,g2)) N{91 ' 92 
1 

, /; , I'cr ig-j 



91,92 

* 



*ik- — ?-\ 

Si PG,TA9l) 



(33) 
(34) 



n(; 



One can now substitute equations |22|, |23| and [25|, using the counts of the whole 
corpus of equations |26| to [32| . After having dropped terms independent of 
G, one obtains 



Flo 



n 

9i,92-N(gi,g2)>l 



n 

91 



1 



N T ( gi ) 



(N T (gi - 1)) 



n 

92 



(n 0) r + 1) 

Ar T (g 2 ) 



i>i,t 

I (35) 



(iVrGfe-l)), 



where n^T is the number of pairs ((71,(72) seen exactly once in T (e.g. the 
number of pairs that will be unseen when used as "held-out" part). Taking 
the logarithm, we obtain the final optimisation criterion F'l' 



rplll 



N r(gi, 92) * log{N T {g X: g 2 ) - 1 - b) 

g 1 ,g 2 :N T (g 1 ,g 2 )>l 



(36) 
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, ,b* (n+ t — 1) s 
+ n ltT *log( V + ' T 

(no,T + 1) 

- n t(9i) * log(N T (gi) - 1) - £ JVt(</ 2 ) * log(N T (g 2 ) - 1). 

91 92 

2.2 Clustering Algorithm 

Given the maximization criterion F^q, we use the algorithm in Figure [I] to 
find a good clustering function G. The algorithm tries to make local changes 
by moving words between classes, but only if it improves the value of the 
optimisation function. The algorithm will converge because the optimisation 
criterion is made up of logarithms of probabilities and thus has an upper limit 
and because the value of the optimisation criterion increases in each iteration. 
However, the solution found by this greedy algorithm is only locally optimal 
and it depends on the starting conditions. Furthermore, since the clustering 
of one word affects the future clustering of other words, the order in which 
words are moved is important. As suggested in ||, the words are sorted by 
the number of times they occur such that the most frequent words, about 
which one knows the most, are clustered first. Moreover, infrequent words 
(e.g. words with occurrence counts smaller than 5) are not considered for 
clustering, because the information they provide is not very reliable. Thus, if 
one starts out with an initial clustering in which no cluster occurs only once, 
and if one never moves words that occur only once, then one will never have 
a cluster which occurs only once. Thus, the assumption we made earlier, 
when it was decided to estimate cluster uni-grams by frequency counts, can 
be guaranteed. 

We will now determine the complexity of the algorithm. Let C be the 
maximal number of clusters for G, let E be the number of elements one tries 
to cluster (e.g. E = \V\), and let / be the number of iterations. When one 
moves w from g w to g' w in the inner loop, one needs to change the counts 
N(g w ,g2) and N(g' w ,g2) for all g 2 . The amount by which the counts need 
to be changed is equal to the number of times w occurred with cluster g 2 . 
Since this amount is independent of g' w , one only needs to calculate it once 
for each w. The amount can then be looked up in constant time within the 
loop, thus making the inner loop of order C. The inner loop is executed once 
for every cluster w can be moved to, thus giving a complexity of the order 
of C 2 . For each w, one needed to calculate the number of times w occurred 
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Algorithm 1: Clustering() 

start with initial clustering function G 
iterate until some convergence criterion is met 

{ 

for all w E V 
{ 

for all g' w eG 
{ 

calculate the difference in F(G) when w is moved from g w to 
9' w 

} 

move w to g' w that results in biggest improvement in F(G) 

} 

} 

End Clustering 

Figure 1: The clustering algorithm 

with all clusters g-i- For that one has to sum up all the bigram counts 
N(w,v) : G(v) = g2, which is on the order of E, thus giving a complexity of 
the order of E + C 2 . The two outer loops are executed / and E times, thus 
giving a total complexity of the order of / * E * (E + C 2 ). 

3 Extending the Clustering Algorithm to iV- 
grams 

It is well known that a trigram model outperforms a bigram model if there 
is sufficient training data. If we want our clustering algorithm to compete 
with unclustered models on a corpus like the Wall Street Journal, where the 
trigram indeed outperforms the bi-gram, it therefore seems logical that the 
clustering algorithm should be extended to deal with trigrams (and higher 
order iV-grams) as well. The original clustered bigram model, as derived 
from equation [|, is 

p(wi\wi-i) = p G (G(wi)\G(wi-i)) *p G (wi\G(wi)). (37) 
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There are at least three ways of extending the clustering to (M + l)-grams, 
depending on how one models the probability p(wi\wi-M, 

a) p G (G(wi)\G(wi-M),---,G(wi-i)) *p G {wi\G(wi)) (38) 

b) Pg(G M +i(wi)\G M(wi- M ), *pa(wi\G M +i(wi)) (39) 

C) PG^CWiJIG^Wj-M, ...,Wi-i)) * PG{Wi\G 2 {Wi)) (40) 

The tradeoff between these models is one of accuracy versus complexity. 
Approach a), which only uses one clustering function G, could produce 

| G ||V| 

different clusterings (for each word in V, it can choose one of the |G| clusters). 
Approach b), which uses M + 1 different clustering functions, can represent 

i=M+l 

E \Gi\ v 

2=0 

different clusterings, including all the clusterings of approach a). Approach 
c), which uses one clustering function for the tuples Wi-M, w%-i and one 
for Wi, can produce 

possible clusterings, including all the ones represented by approach a) and 
b). Approach c) therefore has the highest potential for accuracy, as long 
as there is sufficient training data. Since the Wall Street Journal corpus is 
very large, we decided to use approach c). Please note that for M = 1, 
approach a) gives the traditional clustered bigram approach, but approaches 
b) and c) (c) collapses to b) for M = 1) are more general than the traditional 
model. Moreover, approach c) is referred to in a recent publication (||1Q||) as 
a two-sided (non symmetric) approach. 

Similar to the derivation presented in section ^|, one can now derive the 
optimisation criterion for approach c). However, since it is very similar to the 
derivation shown in section 0, only the final formulae will be given here. The 
complete derivation is given in appendix A. Let g\ and cfe denote clusters of G\ 
and G2 respectively. The optimisation criterion for the extended algorithm 
is 

Flo = E N T (gi,g 2 )*log(N T (g h g 2 )-l-b) (41) 

g 1 ,g 2 :N T (g 1 ,g 2 )>l 
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Algorithm 2: Clustering() 

start with initial clustering function G 
iterate until some convergence criterion is met 

{ 

for all w G V and t G V M 
{ 

for all g' w G G 2 and g' t G G\ 
{ 

calculate the difference in F(G) when w/t is moved from 
9w/9t to g'Jg[ 

} 

move the w/t to the g' w /g' t that results in the biggest improvement 
in F(G) 

} 

} 

End Clustering 

Figure 2: The extended clustering algorithm 

, .b * (n +T — 1), 
(n 0jT + 1) 

- ]T N T ( 9l ) * log(N T ( gi ) - 1) - E N t{92) * log(N T (g 2 ) - 1). 

91 92 

The corresponding clustering algorithm, which is shown in figure 0, is a 
straight forward extension of the one given in section ^. It's complexity can 
be derived as follows. Let Cg 1 and Cq 2 be the maximal number of clusters 
for G\ and G 2 , let E\ and £?2 be the number of elements G\ and G 2 try to 
cluster, let C = max(CG 1 , C*g 2 )? E = rnax(Ei, E 2 ) and let / be the number of 
iterations. When one moves w from g w to g' w in the inner loop (the situation is 
symmetrical for t), one needs to change the counts N(g w ,g 2 ) and N(g' w ,g 2 ) 
for all g 2 G G 2 . The amount by which the counts need to be changed is 
equal to the number of times w occurs with cluster g 2 . Since this amount 
is independent of g' w , one only needs to calculate it once for each w. The 
amount can then be looked up in constant time within the loop, thus making 
the inner loop of order C. The inner loop is executed once for every cluster 
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w can be moved to, thus giving a complexity of the order of C 2 . For each w, 
one needed to calculate the number of times w occurred with all clusters g 2 . 
For that, one has to sum up all the bigram counts N(w, t) : G 2 (t) = g 2 , which 
is on the order of E, thus giving a complexity of the order of E + C 2 . The 
two outer loops are executed I and E times thus giving a total complexity 
of the order of I * E * (E + C 2 ). This is almost identical to the complexity 
of the bigram clustering algorithm given in section |2|, except that E is now 
the number of (M + l)-grams one wishes to cluster, rather than the number 
of unigrams (e.g. words of the vocabulary). 

4 Speeding up the Algorithm 

If one wants to use the clustering algorithm on large corpora, the complexity 
of the algorithm becomes a crucial issue. As shown in the last two sections, 
the complexity of the algorithm is 0(1 * E * (E + C 2 )), where C is the 
maximally allowed number of clusters, I is the number of iterations and 
E is the number of elements to cluster (|V| in case of bigrams, in 
case of the extended algorithm). C crucially determines the quality of the 
resulting language model and one would therefore like to chose it as big as 
possible. Unfortunately, because the algorithm is quadratic in C, this may 
be very costly. We therefore developed the following heuristic to speed up 
the algorithm. 

The factor C 2 comes from the fact that one tries to move a word w to 
each of the C possible clusters (0(C)), and for each of these one has to 
calculate the difference in the optimisation function (0(C) again). If, based 
on some heuristic, one could select a fixed number t of target clusters, then 
one could only try moving w to these t clusters, rather than to all possible 
clusters C. This may of course lead to the situation where one does not 
move a word to the best possible cluster (because it was not selected by 
the heuristic), and thus potentially to a decrease in performance. But this 
decrease in performance depends of course on the heuristic function used and 
we will come back to this issue when we look at the practical results. 

The heuristic used in this work is as follows. For each cluster gi, one keeps 
track of the h clusters that most frequently co-occur with g\ in the tables 
N(g±,g 2 ). For example, if g\ is a cluster of G\ (the situation is symmetric 
for G 2 ), then the h biggest entries in N(gi, g) are the h clusters being stored. 
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When one tries to move a word w, one also constructs a list of the h most 
frequent clusters that follow w (one can get this for free as part of the factor 
E in (E + C 2 )). One then simply calculates the number of clusters that are in 
both lists and takes this as the heuristic score H(g{). The bigger H(gi), the 
more similar are the distributions of w and gi, and the more likely it is that 
gi is a good target cluster to which w should be moved. Because the length 
of the lists is a constant h calculating the heuristic score is also independent 
of C. One can thus calculate the heuristic score of all C clusters in 0(C). 
However, once one has decided to move w to a given cluster, one would have 
to update the lists containing the h most frequent clusters following each 
cluster gi (the lists might have changed due to the last moving of a word). 
Since the update is 0(C) for a given gi, the update would again be 0(C 2 ) for 
all clusters. In order to avoid this, one can make another approximation at 
this point. One can only update the list for the original and the new cluster 
of w. The full update of all the lists is only performed after a certain number 
u of words have been moved. 

To sum up, we can say that one can select t target clusters using the 
heuristic in 0(C). Following that, one tries moving w to each of these t clus- 
ters, which is again 0(C). Moreover, several times per iteration (depending 
on u), one updates the list of most frequent clusters which is 0(C 2 ). Thus, the 
complexity of the heuristic version of the algorithm is 0(I*(E*(E+C)+C 2 )). 
The complexity still contains the factor C 2 , but this time not within the in- 
ner parenthesis. The factor C 2 will thus be smaller than E * (E + C), and is 
only given for completeness. 

We will now present a practical evaluation of the heuristic algorithm. The 
heuristic itself is parameterised by h, the number of most frequent clusters 
one uses to calculate the heuristic score, t, the number of best ranked target 
clusters one tries to move word w to and u, the number indicating after how 
many words a full update of the list of most frequent clusters is performed. 
In order to evaluate the heuristic for a given set of parameters, one can sim- 
ply compare the final value of the approximation function and the resulting 
perplexity of the heuristic algorithm with that of the full algorithm. 

Table [I] contains a comparison of the results using approximately one 
million words of training data (from the Wall Street Journal corpus) and 
values t = 10, h = 10 and u = 1000. The CPU Time given was measured 
on a DEC alpha workstation (DEC 3000, model 600), which was used in 
all the experiments reported in this paper. One can see that the execution 
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Clusters 


Standard Algorithm 
PP CPU Time 


Heuristic Version 
PP (A %) CPU Time (A%) 


10 


746 


1:02 


746 (0.0) 


1:05 (4.8) 


20 


630 


2:28 


653 (3.6) 


1:35 (-36) 


40 


548 


7:46 


558 (1.8) 


2:36 (-67) 


80 


477 


26:41 


490 (2.7) 


4:30 (-83) 


160 


421 


1:33:29 


437 (3.8) 


7:59 (-91) 


320 


394 


5:20:18 


402 (2.0) 


14:17 (-96) 



Table 1: Comparison of the algorithm with its heuristic version 



time of the standard algorithm seems indeed quadratic in the number of 
clusters, whereas that of the heuristic version seems to be linear. Moreover, 
the perplexity of the heuristic version is always within 4% of that of the 
standard algorithm, a number which does not seem to increase systematically 
with the number of clusters. Furthermore, the speed up of the algorithm 
seems to be closely related to the number of clusters divided by t. For 
example, in the case of 320 clusters, this ration is 320/10 = 32 and the 
heuristic version is indeed almost 32 times as fast (the speed up is almost 1 — 
^ = 0.97). Judging from the time behaviour of the standard algorithm, one 
would expect it to take around 32 hours to run with 1000 clusters, whereas 
the heuristic algorithm, as will be shown later, only takes about half an hour 
(for t = 10). 

Tables ^ to ^ contain a more detailed analysis of the influence of the 
parameters t, u, and h on the heuristic version of the algorithm, this time 
with a maximal number of allowed clusters of 1000. The first point to note is 
that in all tables, a change in the value of the optimisation function is very 
closely related to a change in perplexity. This is a very reassuring finding, 
because it indicates that the clustering algorithm actually tries to optimise 
the correct criterion. 

From table [|, one can see that an increase in t leads to an increase in 
execution time, but also to an increase in performance. This is because as t 
increases, the chances of the heuristic missing the overall best target cluster 
for a given word w decreases. 

In table [|, one can see that the effect of u on the algorithm is very minor. 
This could be explained by the fact that even though the full lists of most 
frequent clusters are not updated at every move, the update in clusters g w 
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t opt PP time 

~5 -1.246e+07 359 20:24 

10 -1.243e+07 354 31:35 

20 -1.241e+07 350 55:32 

40 -1.240e+07 349 1:34:16 

80 -1.239e+07 348 2:51:19 

Table 2: Results for u = 1000 and h = 10 

u opt PP time 

4 -1.243e+07 352 36:27 

20 -1.243e+07 353 34:00 

100 -1.243e+07 353 34:00 

500 -1.243e+07 354 34:00 

2500 -1.243e+07 354 32:38 

Table 3: Results for t = 10 and h — 10 

and g' w , which is performed at every move, contains the most important 
changes. 

Finally, in table [|, one can see that the performance of the algorithm 
decreases with an increase in h. This in a way counter intuitive result could 
be explained by the following hypothesis. If the suitability of a target cluster 
is determined by a small number of very frequently co-occurring clusters, then 
increasing h could make the heuristic perform worse, because the effect of 
the most important clusters is perturbed by a large number of less important 
clusters (the heuristic only counts the number of clusters in both lists and 
does not weigh them). 

Based on the results of these experiments, we chose h — 5, t — 10 and 
u = 1000 for future experiments with the heuristic version of the algorithm. 



h 


opt 


PP 


time 


5 


-1.243e+07 


353 


31:09 


10 


-1.243e+07 


353 


33:09 


20 


-1.245e+07 


357 


36:26 


40 


-1.254e+07 


370 


41:50 



Table 4: Results for t = 10 and u — 10 
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5 Results 



In the following, we will present results of clustered language models on the 
Wall Street Journal (WSJ) corpus. The work reported here was performed on 
the WS JO corpus, using the verbalised pronunciation (VP) and non verbalised 
pronunciation (NVP) versions of the corpus with the 20K open vocabulary. 
As mentioned on the CDROM (and as discussed in [PUD, the results for open 
vocabularies are usually not meaningful, if the unknown words are taken 
into account when calculating the perplexity. One way to solve this problem 
( ||13||) is to simply skip the unknown words, when calculating the perplexity. 
Since all our experiments were performed with the open vocabulary, this 
is the approach taken here, except when indicated otherwise. In order to 
investigate the influence of the amount of training data on the results, we 
used seven different sets of training data, Tl to T7, with about 2K, 12K, 
60K, 350K, 1.7M, 8.5M and 40M words respectively. All perplexity results 
were calculated on approximately 2.3 million words of text that were not 
part of the training material. The clustered models were produced with the 
extended heuristic version of the algorithm. To run to completion, it took 
less than 12 hours real time for the bigram case, and several days for the 
trigram case. 

As a yardstick for the performance of the clustered models, we imple- 
mented the commonly used compact back-off model (0], Because the 
bigram counts were not smoothed, the probability mass, that could be re- 
distributed to unseen events, was only gained through events that fell below 
the cut-off threshold. If a given cut-off threshold did did not lead to any 
gained probability mass for a particular distribution (because no event was 
below the threshold and thus no probability mass could be redistributed), the 
cut-off threshold of this distribution was set to the lowest value, that would 
lead to some gain in probability mass. Table |5] and |5] give the perplexity 
of back-off models with various cut-off thresholds C for verbalised and non- 
verbalised pronunciation respectively. First, one can see that a bigger value 
of C leads to a higher perplexity. This is because as C increases, more and 
more bigram counts are discarded and replaced by unigram, rather than bi- 
gram, probability estimates. However, a good reason why higher values of C 
might still be of interest is that they lead to substantially smaller models and 
this can be of crucial importance for the time performance of a recogniser. 
Second, and more importantly for our purposes, the results seem comparable 
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Training Set 


C = 250 


C = 50 


C = 10 


C = 2 


Tl 


2180 


2180 


2150 


2240 


T2 


1350 


1320 


1210 


1130 


T3 


1030 


936 


750 


621 


T4 


812 


645 


480 


373 


T5 


620 


462 


330 


254 


T6 


456 


323 


236 


190 


T7 


324 


233 


182 


159 



Table 5: Back-off perplexity results on VP data 



Training Set 


C = 250 


C = 50 


C = 10 


C = 2 


Tl 


3170 


3170 


3150 


3220 


T2 


1980 


1950 


1790 


1610 


T3 


1440 


1310 


1060 


878 


T4 


1170 


936 


696 


537 


T5 


928 


693 


488 


369 


T6 


666 


466 


332 


265 


T7 


464 


327 


250 


216 



Table 6: Back-off perplexity results on NVP data 
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Training 


back-off 


clustered 


improvement (%) 


Tl 


2240 


1750 


22 


T2 


1130 


831 


27 


T3 


621 


515 


17 


T4 


373 


324 


13 


T5 


254 


231 


9.1 


T6 


190 


188 


1.1 


T7 


159 


172 


-8.2 


7: Perplexity results 


for (2000,2000) bigram clusters 


Training 


back-off 


clustered 


improvement ( % ) 


Tl 


3220 


2420 


25 


T2 


1610 


1230 


24 


T3 


878 


762 


13 


T4 


537 


491 


8.6 


T5 


369 


345 


6.5 


T6 


265 


267 


-0.76 


T7 


216 


240 


-11 



(VP) 



Table 8: Perplexity results for (2000,2000) bigram clusters (NVP) 



to other results reported in the literature. In [Ij for example, the perplexity 
results for the non-verbalised data with open vocabulary is 205, quite close 
to our 216 (for (7 = 2). However, it is quite likely that the probabilities of 
unknown words were taken into account for the calculation of the 205 value 
and our model also gives a perplexity of 205 in that case. The back-off results 
of tables [5] and |] therefore constitute a reasonable yardstick to evaluate the 
performance of the clustered language models. 

Tables |7| and § give the results of a clustered bigram with 2000 clusters 
for both Gi and G2, for verbalised and non- verbalised pronunciation respec- 
tively. For better comparison, the matching results of the back-off models are 
repeated and the difference is given in percent. Even though the clustered 
model performs worse than the back-off model on the largest set of data, it 
outperforms the back-off model in almost all other cases. This clearly shows 
the superior robustness of the clustered models. 
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Training 


clustered bigram clustered trigram 
(3000,3000) (7000,1000) 


improvement ( % ) 


5.2M 
41M 


191 208 
167 151 


-8.9 
9.6 



Table 9: Results for bigram and trigram clusters (VP) 



Table § shows the results for a clustered tri-gram with 7000 and 1000 
clusters for G\ and G2 on VP data. Because these results were obtained on 
slightly different training and testing texts, the table also contains the results 
of the clustered bi-gram on the same data. One can see that the clustered 
trigram outperforms the clustered bigram, at least with sufficient training 
data. But even with only five million words of training data, the clustered 
trigram is only slightly worse than the clustered bigram, showing again the 
robustness of the clustered language models. 

From all the results given here, one can see that the clustered language 
models can still compete with unclustered models, even when a large corpus, 
such as the Wall Street Journal corpus, is being used. 

6 Conclusions 

In this paper, an existing clustering algorithm is extended to deal with higher 
order iV-grams. Moreover, a heuristic version of the algorithm is introduced, 
which leads to a very significant speed up (up to a factor of 32), with only a 
slight loss in performance (5%). This makes it possible to apply the result- 
ing algorithm to the clustering of bigrams and trigrams on the Wall Street 
Journal corpus. The results are shown to be comparable to standard back-off 
bigram models. Moreover, in the absence of many million words of training 
data, the clustered model is more robust and clearly outperforms the non- 
clustered models. This is an important point, because for many real world 
speech recognition applications, the amount of training data available for a 
certain task or domain is in general unlikely to exceed several million words. 
In those cases, the clustered models seem like a good alternative to back-off 
models and certainly one that deserves close investigation. 

3 Only the 500,000 most frequent bigrams were clustered using G\. 
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The main advantage of the clustering models, its robustness in the face of 
little training data, can also be seen from the results and in these situations, 
the clustered algorithm is preferable to the standard back-off models. 



Appendix A: Deriving the Optimisation Func- 
tion 

In this appendix, we will present the derivation of the optimisation function 



for the extended clustering algorithm in detail. It is a generalisation of [[15 
where the derivation was given for M — 1. 

Let G be a short hand to denote both classification functions G\ and 
G 2 . Following the same approach as in section 0, one can estimate the 
probabilities in equation [40] using the maximum likelihood estimator 

Pg(w\v m , -,vi) = p(G 2 (w)\Gi(vm,~;Vi))*p(w\G 2 (w)) (42) 

N(g u g 2 ) N(g 2 ,y) 



* 



N(gi) N(g 2 ) ' 

where g\ = Gi(vm, vi), 92 = G 2 (w) and N(x) denotes the number of times 
x occurs in the data. Given these probability estimates Pg(w\vm, the 
likelihood Fml of the training data, e.g. the probability of the training 
data being generated by our probability estimates pg(w\vm, ■■■,Vi), measures 
how well the training data is represented by the estimates and can be used 
as optimisation criterion (||). The likelihood of the training data Fml is 
simply 

N 

Fml = nPa{wi\wi-M, (44) 

i=l 

tt N(G 1 (w i -M, ...■,w i - 1 ),G 2 {wi)) N(G 2 (wj),Wj) 

l\ N(Gi(wi- M , ...,u>i-i)) * N(G 2 ( Wl )) ' 1 j 

Assuming that the classification is unique, e.g. that G\ and G 2 are functions, 
N(G 2 {wi),Wi) = N(wi) always holds (because Wi always occurs with the same 
class G 2 (wi)). Since one is trying to optimise Fml with respect to G, one can 
remove any term that does not depend on G, because it will not influence 
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the optimisation. It is thus equivalent to optimise 

F > JX- N(Gi(wj- M , ...,Wi-i),G 2 (wi)) 1 

ML t\ N(G 1 (w t . M ,...,w^ 1 )) N(G 2 ( Wi )) 1 ] 

N 

='■ J\f{wi-M,-,Wi^ u Wi). (47) 

i=l 

If, for two tuples (w^m, —, Wt-i, w t ) and (wj- M , —,Wj-h Wj), Gi(w^ M , —, w»- 
Gx(wj-. M , Wj-i) and (G 2 (wi) = G 2 (wj) is true, then f(wi- M , —, m-x, Wi) = 
f(vjj_M, w j-h w j) a l so holds. One can thus regroup identical terms to ob- 
tain 

f ml = n * vh^* ^ 

gi 9 2 N K9i) N(g 2 ) 

where the product is over all possible pairs (gi,g 2 ). Because N(g{) does not 
depend on g 2 and N(g 2 ) does not depend on g%, one can simplify this again 
to 

1 N(gi) i N(g 2 ) 

F ML =xiN^g 2 r^xi' n^-r • ^ 

91,32 91 ly y9l) g 2 ^{92) 

Taking the logarithm, one obtains the equivalent optimisation criterion 



Kl = J2 N (9^92)*log(N( gi ,g 2 ))-J2 N (9i)*log(N( gi )) (50) 

91:92 91 

- Y, N (92)*log{N{g 2 )). 
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F'ml is the maximum likelihood optimisation criterion which could be 
used to find a good classifications G. However, the problem with this max- 
imum likelihood criterion is the same as in section In the following, a 
leaving-one-out criterion is therefore developed. 

Let Tj denote the data without the pair (w^m, •••> w i-ii w i) &n.d-PG,Ti{ w \ v M, 
the probability estimates based on a given classification G and training 
corpus Tj. Given a particular T, the probability of the "held-out" part 
(wi_M, Wi) is PG,Ti(wi\wi^ M , The probability of the com- 

plete corpus, where each pair is in turn considered the "held-out" part is the 
leaving-one-out likelihood L L q 



N 



Llo = YlPGfliiWilwi-M, (51) 



i=l 
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In the following, we will derive an optimisation function Flo by specifying 
how pG,Ti{wi\wi-M, ■■■yWi-i) is estimated from frequency counts. One first 
rewrites PG,Ti( w i\ w i-M> w i-i) as usual (see equation f!0|): 

PG,Ti{w\v M , = PG,T i {G2{w)\G 1 (v M , ■ ■■,v 1 )) * P G , Ti {w\G 2 {w)){h2) 

Pg,tA9i,92) PG.Tjjg^w) , . 

= r^— * 7 — r~ > (53) 

Pg,tA9i) Pg,tA92> 

where g\ = Gi(vm, fi) and g 2 = G 2 {w). Now we will specify how we 
estimate each term in equation [53[ 

As before, pc,Ti can be dropped from the optimisation criterion and rel- 
ative frequencies can be used as estimators for the class unigrams: 

PG.T.M = ^ (54) 

Pg,tM = — ^ • (55) 

In the case of the class bi-gram, one can again use the absolute discounting 
method for smoothing. Let rio.T, be the number of unseen pairs (gi,g 2 ) and 
n +i Ti the number of seen pairs (gx,g 2 ), leading to the following smoothed 
estimate 



u i lb . ,. . ri (56) 



v, if (01,02) >0 

,„,, ,/VrT if ^(01,02) =0 



Again, the empirically determined constant value 6 = 0.75 is used during 
clustering. The probability distribution pG,Ti(gi, 02) will always be evaluated 
on the "held-out" part (w^m, Wj-i, Wt) and with g^j = G 1 (wj_ A/ /, Wj_i) 
and g 2 i = G 2 {wi) one obtains 



PG,Ti{gi,i, 02,i) 

L :: ifl2 - <) - & if aw 0i „, 0o /i > 

(57) 



— if N Tt (gi,i, g 2 ,i) > 

„ , V/ if N Ti (gi, i} g2,i) = 
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In order to facilitate future regrouping of terms, one again expresses the 
counts Nt^ NtXqi) etc. in terms of the counts of the complete corpus T as 
follows: 



NtM 
NtM 



1 



A- 



N T -1 

N T (gi) - 1 
N T (g 2 ) - 1 

N T (gi,i, 92,1] 
N T -1 

{ n +tT if N T {gi,i, 92,i) > 1 

\ n +;T - 1 if N T (gi,i, 92,i) = 1 

f n , T if N T {gx^g 2 ,j) > 1 
[ n 0iT - 1 if N T (g hi , g 2>i ) = 1 

After dropping PctXw) and substituting the expressions back into equation 



T 



n 0,T, 



(58) 
(59) 
(60) 
(61) 
(62) 

(63) 
(64) 



51 , one obtains: 



F' 

r LO 



yr PG,Tj{.9\,i,92,i) ^ 



PG,Ti(g 



II (PG,T i {gi,g2)) 



N {91,92) 



9i,92 



2,i) 

*II L ( ^ 



\N{9l) 



92 



PG,Ti(92 



(65) 
) N ^G) 



One can now substitute equations |54], |55| and |57], using the counts of the whole 
corpus of equations |58| to [64] . After having dropped terms independent of 
G, one obtains 



Flo 



II (N T ( 9l ,g 2 )-l-b) N ^U 

9i,92-N{gi,g2)>l 

\ N T {gi) / 1 



n 

91 



1 



(n ,T + 1) 

N T {g 2 ) 



n 

92 



XN T (g 2 -l)) i 



where n^T is the number of pairs (pi,^) seen exactly once in T (e.g. the 
number of pairs that will be unseen when used as "held-out" part). Taking 
the logarithm, one obtains the final optimisation criterion F' L " 



TP'" 



n t(9i, 92) * log{N T {g X: g 2 ) - 1 - b) 

9i,92-N T {gi,92)>l 



(68) 
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_ ,b* (n +T — 1). 
(n ,T + 1) 

- N t(9i) * log(N T ( gi ) - 1) - ]T N T (g 2 ) * log{N T {g 2 ) - 1). 

91 92 
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